EDA Diabetes Pima Indiana¶

An EDA will be conducted on this dataset,so that we can be able to see any trends and correlations within our Data. This will aid us in forming our Hypothesis and subsequently train our Machine Learning models accordingly

Project Goal (Objective):¶

The goal of this project is to develop a machine learning model that can predict whether a patient is likely to be diagnosed with diabetes based on diagnostic measurements. The model should take into account features such as glucose level, BMI, age, and others to classify patients into diabetic or non-diabetic categories.

Motivation:¶

Diabetes is a chronic condition that can lead to serious health complications if left undetected and untreated. Early prediction enables:

  • Preventive care and lifestyle interventions
  • Reduction in healthcare costs
  • Improved quality of life for at-risk individuals

By using machine learning, we aim to:

  • Identify patterns in diagnostic data that signal risk of diabetes
  • Assist healthcare professionals in decision support
  • Gain insights into which features are most predictive of diabetes.rview of our Dataset

About Our Data set¶

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The dataset consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so od 1 for Diabetic

Dataset Overview:¶

Rows: 768 patients

Columns: 9 features

Target Variable: Outcome (1 = Diabetes, 0 = No Diabetes)

columns¶

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skinfold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age in years
  • Outcome: Class variable (0 or 1)
In [16]:
# importing the n ecessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import scipy as sp
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline 
In [20]:
# reading our data
file_path = r'C:\Users\norit\OneDrive\Desktop\Machine learning\diabetes.csv'
df = pd.read_csv(file_path)
df
Out[20]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0

768 rows × 9 columns

In [22]:
# getting statistical info from our dataset
df.describe()
Out[22]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
In [24]:
#Getting information of the data type of our columns and its contents
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [26]:
# getting the shape of our data
df.shape
Out[26]:
(768, 9)
In [30]:
# finding out if we null values
df.isnull().sum()
Out[30]:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
In [32]:
# checking the Distribution of our various columns
p = df.hist(bins=20, figsize= (20,20))
No description has been provided for this image

Cleaning our data set¶

From the histograms abve which portray how the values in the columns are distributed, we discover some abnormalies. There are certain columns that contain some null values which do not make medical sense such as:

Glucose colu

  • n

BloodPress

  • re

SkinThick

  • ess

In

  • ulin

BMI In order to get better better results for our Analysis we will replace the '0' values in these columns with NAN and in then replace with the median value of the values in the respective columns.

In [36]:
#creating a new dataframe to continue our EDA

df_EDA = df.copy(deep = True)
df_EDA[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df_EDA[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

## showing the count of Nans
print(df_EDA.isnull().sum())
Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64
In [38]:
#replace nan with  median for the purpose of EDA

df_EDA['Glucose'].fillna(df_EDA['Glucose'].median(), inplace = True)
df_EDA['BloodPressure'].fillna(df_EDA['BloodPressure'].median(), inplace = True)
df_EDA['SkinThickness'].fillna(df_EDA['SkinThickness'].median(), inplace = True)
df_EDA['Insulin'].fillna(df_EDA['Insulin'].median(), inplace = True)
df_EDA['BMI'].fillna(df_EDA['BMI'].median(), inplace = True)
In [40]:
#plotting new distribution after the '0' has been replaced

p = df_EDA.hist(bins=20,figsize = (20,20))
No description has been provided for this image
In [42]:
#Finding the relationship between the various columns and how it affects the individuals rate of having diabetes or not.

p=sns.pairplot(df_EDA, hue = 'Outcome')
No description has been provided for this image

Pairplot Findings

From the above pairplot we can notice a linear relationship between BMI and Skin Thickness. The thicker an indivual is,the higher her BMI and this also leads to a higher chance of the individual having Diabetes.

In [48]:
#Plotting a scatter matrix to gather more info between the columns

from pandas.plotting import scatter_matrix
p=scatter_matrix(df_EDA,figsize=(25, 25))
No description has been provided for this image

Findings from scatter plot

As earlier discovered we see a clear relationship between Skin Thickness and BMI . Also we notice that there might exist a slight relationship between Glucose and Insulin.

However before concluding we will draw a correlation matrix to get the exact correlation between these colum

In [51]:
# Correlationn Matrix

plt.figure(figsize = (12,10))

sns.heatmap(df_EDA.corr(), annot =True)
Out[51]:
<Axes: >
No description has been provided for this image

Findings from correlation matrix

From the above correlation matrix Glucose has the highest correlation with our label column 'Outcome' at 0.49. Let us use a violin plot to see this.

In [61]:
#Classifying the Glucose based on class
ax = sns.violinplot(x='Outcome', y='Glucose', data=df_EDA, palette='muted')
No description has been provided for this image

Findings from Violin Plot

Observing the violin plot, we see a massive vertical distance between the box-plot for Diabetics and Non-Diabetics. This indicates that Glucose can be a very important variable in model-building.

Conclusion on our EDA and setting our Hypothesis¶

From our Explorative Data Analysis,we can set the following Hypothesis:

  • Diabetics seem to have a higher blood pressure than the non-diabetics
  • BMI for diabetics is more than BMI for non-diabetics
  • Diabetics seem to have a higher pedigree function that the non-diabetics
  • Diabetics seem to have a higher level of Glucose than non-diabetics
  • It can be roughly hypothesized that Insulin for Diabetics is lower than Non-Diabetics.
  • It can be observed that diabetic women had more pregnancies than non-diabetic
  • Skin Thickness for Diabetics is more than that of Non-Diabetics
  • The older a woman gets,the more chances she has of becoming Daibetic.

In conclusion:

To conclude we will not be deleting any of our features as we can see that they all have somewhat a correlation with our label column 'Outcome'

Preprocessing for Machine Learning Algorithms¶

Split into Test and Train Sets

In [69]:
# Die Verteilung der Ergebnisvariablen in den Daten wird untersucht und visualisiert.
f,ax=plt.subplots(1,2,figsize=(18,8))
df['Outcome'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('target')
ax[0].set_ylabel('')
sns.countplot(x='Outcome',data=df,ax=ax[1])
ax[1].set_title('Outcome')
plt.show()
No description has been provided for this image

Findings

We can see that one class has a lot more weight than the other class. It is important that,when splitting our data into the training and testing set,we must startify. This is to make sure that the distribution between both classes in both sets are the same.

In [72]:
#create a new dataframe that replaces the columns with '0' values with Nan in the columns below:

df_NAN = df.copy(deep = True)
df_NAN[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df_NAN[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

## showing the count of Nans
print(df_NAN.isnull().sum())
Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64
In [74]:
#convert pandas dataframe to numpy array 


df_ML=df_NAN.to_numpy()

#split into features and target
x = df_ML[:, 0:7]
y = df_ML[:, 8]
In [76]:
#We must stratify while splitting into test and train
#We will stratify using our label/target column (y)

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=45, stratify=y)

Imputing (replace the nan value with simple imputer :Mean)

In [79]:
x_train
Out[79]:
array([[0.00e+00, 1.52e+02, 8.20e+01, ..., 2.72e+02, 4.15e+01, 2.70e-01],
       [3.00e+00, 1.74e+02, 5.80e+01, ..., 1.94e+02, 3.29e+01, 5.93e-01],
       [5.00e+00, 9.50e+01, 7.20e+01, ...,      nan, 3.77e+01, 3.70e-01],
       ...,
       [4.00e+00, 1.20e+02, 6.80e+01, ...,      nan, 2.96e+01, 7.09e-01],
       [3.00e+00, 1.02e+02, 4.40e+01, ..., 9.40e+01, 3.08e+01, 4.00e-01],
       [0.00e+00, 1.05e+02, 6.80e+01, ...,      nan, 2.00e+01, 2.36e-01]])
In [81]:
np.isnan(x_train)
Out[81]:
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ...,  True, False, False],
       ...,
       [False, False, False, ...,  True, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ...,  True, False, False]])
In [83]:
#Start with x_train

from sklearn.impute import SimpleImputer
imputer_1=SimpleImputer(missing_values=np.NaN,strategy="mean")


imputer_1.fit(x_train)
Out[83]:
SimpleImputer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
missing_values  nan
strategy  'mean'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
In [85]:
x_train=imputer_1.transform(x_train)
In [87]:
#proofing if we still have nan values
np.isnan(x_train)
Out[87]:
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])
In [89]:
print(np.nan in x_train)
False
In [91]:
#Now on x_test

imputer_1.fit(x_test)
Out[91]:
SimpleImputer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
missing_values  nan
strategy  'mean'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
In [93]:
x_test=imputer_1.transform(x_test)
In [95]:
print(np.nan in x_test)
False

Scaling

In [98]:
#scaling x_train

from sklearn.preprocessing import StandardScaler

ss=StandardScaler()
ss.fit(x_train)
x_train=ss.transform(x_train)
In [100]:
#scaling x_test

from sklearn.preprocessing import StandardScaler

ss=StandardScaler()
ss.fit(x_test)
x_test=ss.transform(x_test)

Testing Our Machine Learning Algorithms¶

In order to find a good model to predict if a patient will have diabetes or not, our expected score should be 0.9. Because the nature of this dataset is of a classification type we will be using the following classification models: gression

  • Niaves Bayes
  • Random Forest
  • KNN
  • Decision Trees
  • SVM
  • Logistic Regression

Naives Bayes Algorithm Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions. It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

The reason why we have chosen to test this model is for the following reasons:

  • Naïve Bayes is one of the fastest and easy ML algorithms to predict a class of datasets.
  • It is the most popular choice for text classification problems
  • It is suitable for use in medical data classification
In [111]:
#Using the GaussianNB

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(x_train,y_train)
Out[111]:
GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
priors  None
var_smoothing  1e-09
In [113]:
#predict

y_pred=gnb.predict(x_test)
In [115]:
#Check for accuracy score on both test and train to check for overfitting

from sklearn.metrics import accuracy_score

print('Training set score: {0:0.4f}'. format(gnb.score(x_train,y_train)))
print('Testing set score: {0:0.4f}'. format(gnb.score(x_test,y_test)))
Training set score: 0.7573
Testing set score: 0.7857

Result 1 From our above results,it may seem that there is no overfitting as there is no big difference between our train and test accuracy scores.

To better understand the scores divided into the seperate classes, we will do some more investigations below.

In [118]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error


print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",gnb.score(x_train,y_train)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
Classification Report is:
               precision    recall  f1-score   support

         0.0       0.81      0.88      0.84       100
         1.0       0.73      0.61      0.67        54

    accuracy                           0.79       154
   macro avg       0.77      0.75      0.75       154
weighted avg       0.78      0.79      0.78       154

Confusion Matrix:
 [[88 12]
 [21 33]]
Training Score:
 75.7328990228013
Mean Squared Error:
 0.21428571428571427
R2 score is:
 0.05888888888888866

Result 2 This implies, our model predicted classified correctly 79% of the times.

The Precision score (macro avg) stood at 0.77, implying our model correctly classified observations with high risk of Diabetes in the high risk category 77% of the times. The Recall stood at 0.73.

We also have an F1 score of 0.75. The F1 score is the harmonic mean of precision and recall. It assigns equal weight to both the metrics. However, for our analysis it is relatively more important for the model to have low false negative cases (as it will be dangerous to classify high risk patients in low risk category). Therefore, we individually look at Precision and Recall.

Improving our model

*Using Cross Validation*

In [123]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(gnb, x_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))
Cross-validation scores:[0.80645161 0.77419355 0.82258065 0.70967742 0.6557377  0.63934426
 0.75409836 0.7704918  0.80327869 0.78688525]
In [125]:
print('Average cross-validation score: {:.4f}'.format(scores.mean()))
Average cross-validation score: 0.7523

Result 3 After performing cross Validation on our model, we see no improvement,so we move to parameter tuning with gridsearch.

Using Parameter Tuning with Grid search

In [141]:
# Print best parameters
print("Best Parameters:", grid_search.best_params_)

# Print best estimator
print("Best Estimator:", grid_search.best_estimator_)

# Print best score
print("Best Cross-Validation Accuracy:", grid_search.best_score_)
Best Parameters: {'var_smoothing': 0.23101297000831597}
Best Estimator: GaussianNB(var_smoothing=0.23101297000831597)
Best Cross-Validation Accuracy: 0.759016393442623
In [139]:
results = pd.DataFrame(grid_search.cv_results_)
print(results[['param_var_smoothing', 'mean_test_score', 'std_test_score']].head())
   param_var_smoothing  mean_test_score  std_test_score
0             1.000000         0.747581        0.028881
1             0.811131         0.745968        0.031982
2             0.657933         0.755751        0.036419
3             0.533670         0.754125        0.042252
4             0.432876         0.757377        0.036216

Result 4 The gridsearch did not offer any better results.

CONCLUSION ON NAIVE BAYES

  • From our model we can conclude that we expect the model to be around 79% accurate
  • Our original model accuracy is 0.78, but the mean cross-validation accuracy is 0.75. So, the 10-fold cross-validation accuracy does not result in performance improvement for this model.

Support Vector Machines (SVM)

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

Decision Funktion: OVR

In [153]:
from sklearn.svm import SVC 


svc_ovr = SVC(decision_function_shape='ovr', kernel='rbf', C=1) #ovr = one versus rest 
svc_fit=svc_ovr.fit(x_train, y_train) #fit on training data
pred_ovr= svc_fit.predict(x_test) #predict on X_test
score_ovr_train = round(svc_fit.score(x_train, y_train),3) #predict score on X_train, Y_train
score_ovr_test = round(svc_fit.score(x_test, y_test),3) #predict score on X_test, Y_test
print(f'Score SVM (one versus rest) on training set: {score_ovr_train}')
print()
print(f'Score SVM (one versus rest) on test set: {score_ovr_test}')
print()
print(f'real Data: {y_test}') 
print()
print(f'predicted Data: {pred_ovr}')
Score SVM (one versus rest) on training set: 0.819

Score SVM (one versus rest) on test set: 0.734

real Data: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1.
 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0.
 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.
 1. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1.
 0. 1. 1. 0. 1. 1. 1. 0. 0. 0.]

predicted Data: [0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0.
 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0.
 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1.
 0. 1. 0. 0. 0. 1. 1. 0. 0. 0.]

Decision Funktion : OVO

In [156]:
svc_ovo = SVC(decision_function_shape='ovo', kernel='sigmoid') 
svc_fit=svc_ovo.fit(x_train, y_train) 
pred_ovo= svc_fit.predict(x_test) 
score_ovo_train = round(svc_fit.score(x_train, y_train),3) 
score_ovo_test = round(svc_fit.score(x_test, y_test),3)
print(f'Score SVM (one versus one) on training set : {score_ovo_train}')
print()
print(f'Score SVM (one versus one) on test set: {score_ovo_test}')
print()
print(f'Real label: {y_test}') 
print()
print(f'Predicted label: {pred_ovo}') 
Score SVM (one versus one) on training set : 0.697

Score SVM (one versus one) on test set: 0.714

Real label: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1.
 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0.
 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.
 1. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1.
 0. 1. 1. 0. 1. 1. 1. 0. 0. 0.]

Predicted label: [0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0.
 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1.
 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 1.
 0. 1. 0. 0. 1. 0. 1. 0. 0. 0.]

Iteration on Training set-Features

Features will be dropped one after another and new scores realised

In [160]:
colums = [0,1,2,3,4,5,6]

for i in colums:
    x_del = np.delete(x_train, i, 1)
    #print(f'shape for data without column {i}: {data_concat_del.shape[1]}')

    #apply Support Vector Machines with one versus rest:
    svc_ovr = SVC(decision_function_shape='ovr') #ovr = one versus rest 
    svc_fit1 = svc_ovr.fit(x_del, y_train)
    pred_ovr = svc_fit1.predict(x_del)
    score_ovr_train = round(svc_fit1.score(x_del, y_train),3)
    print(f'Score SVM (one versus rest) on training set without feature{i}: {score_ovr_train}')
    # Score: 0.8125 with SimpleImputer(strategy = 'mean')
    # Score: 0.814 with SimpleImputer(strategy = 'median')
    #print(f'target:\n {y_train}')
    #print(f'predicted SVM OVR:\n {pred_ovr}')

    #apply Support Vector Machines with one versus one:
    svc_ovo = SVC(decision_function_shape='ovo') #ovr = one versus one 
    svc_fit2 = svc_ovo.fit(x_del, y_train)
    pred_ovo = svc_fit2.predict(x_del)
    score_ovo_train = round(svc_fit2.score(x_del, y_train),3)
    print(f'Score SVM (one versus one) on training set without feauture{i}: {score_ovo_train}')
    # Score: 0.814 with SimpleImputer(strategy = 'mean')
    # Score: 0.814 with SimpleImputer(strategy = 'median')
    #print(f'target:\n {y_train}')
    #print(f'predicted SVM OVO:\n {pred_ovo}')
    #Fazit: Score bleibt gleich zwischen ovr und ovo

    print()
print("Conclusion: Score between OVR and OVO remains the same")
Score SVM (one versus rest) on training set without feature0: 0.808
Score SVM (one versus one) on training set without feauture0: 0.808

Score SVM (one versus rest) on training set without feature1: 0.752
Score SVM (one versus one) on training set without feauture1: 0.752

Score SVM (one versus rest) on training set without feature2: 0.818
Score SVM (one versus one) on training set without feauture2: 0.818

Score SVM (one versus rest) on training set without feature3: 0.811
Score SVM (one versus one) on training set without feauture3: 0.811

Score SVM (one versus rest) on training set without feature4: 0.808
Score SVM (one versus one) on training set without feauture4: 0.808

Score SVM (one versus rest) on training set without feature5: 0.801
Score SVM (one versus one) on training set without feauture5: 0.801

Score SVM (one versus rest) on training set without feature6: 0.79
Score SVM (one versus one) on training set without feauture6: 0.79

Conclusion: Score between OVR and OVO remains the same

Analysis of the above results

  • The score does not chnage between OVR and OVO.

The score is highest when kernel = "rbf"

No difference is noticed if teh simple imputer si changed from mean to media

  • .

In general the score gets better when some particular feautures are dropped.

Iteration on Test-Features Features will be dropped one after another and new scores realised

In [166]:
colums = [0,1,2,3,4,5,6]

for i in colums:
    x_del = np.delete(x_test, i, 1)
    #print(f'shape for data without column {i}: {data_concat_del.shape[1]}')

    #apply Support Vector Machines with one versus rest:
    svc_ovr = SVC(decision_function_shape='ovr') #ovr = one versus rest 
    svc_fit1 = svc_ovr.fit(x_del, y_test)
    pred_ovr = svc_fit1.predict(x_del)
    score_ovr_test = round(svc_fit1.score(x_del, y_test),3)
    print(f'Score SVM (one versus rest) on test set without feature {i}: {score_ovr_test}')
    # Score: 0.8125 with SimpleImputer(strategy = 'mean')
    # Score: 0.814 with SimpleImputer(strategy = 'median')
    #print(f'target:\n {y_train}')
    #print(f'predicted SVM OVR:\n {pred_ovr}')

    #apply Support Vector Machines with one versus one:
    svc_ovo = SVC(decision_function_shape='ovo') #ovr = one versus one 
    svc_fit2 = svc_ovo.fit(x_del, y_test)
    pred_ovo = svc_fit2.predict(x_del)
    score_ovo_test = round(svc_fit2.score(x_del, y_test),3)
    print(f'Score SVM (one versus one) on test set without feauture {i}: {score_ovo_test}')
    # Score: 0.814 with SimpleImputer(strategy = 'mean')
    # Score: 0.814 with SimpleImputer(strategy = 'median')
    #print(f'target:\n {y_train}')
    #print(f'predicted SVM OVO:\n {pred_ovo}')
    #Fazit: Score bleibt gleich zwischen ovr und ovo

    print()
print("Conclusion: Score between OVR and OVO reamins same")
Score SVM (one versus rest) on test set without feature 0: 0.844
Score SVM (one versus one) on test set without feauture 0: 0.844

Score SVM (one versus rest) on test set without feature 1: 0.812
Score SVM (one versus one) on test set without feauture 1: 0.812

Score SVM (one versus rest) on test set without feature 2: 0.844
Score SVM (one versus one) on test set without feauture 2: 0.844

Score SVM (one versus rest) on test set without feature 3: 0.838
Score SVM (one versus one) on test set without feauture 3: 0.838

Score SVM (one versus rest) on test set without feature 4: 0.825
Score SVM (one versus one) on test set without feauture 4: 0.825

Score SVM (one versus rest) on test set without feature 5: 0.844
Score SVM (one versus one) on test set without feauture 5: 0.844

Score SVM (one versus rest) on test set without feature 6: 0.844
Score SVM (one versus one) on test set without feauture 6: 0.844

Conclusion: Score between OVR and OVO reamins same

Observation

  • The score does not chnage between OVR and OVO.
  • The score is highest when kernel = "rbf".
  • No difference is noticed if teh simple imputer si changed from mean to median.
  • In general the score gets better when some particular feautures are dropped.
  • The test score is better than the training score.

Validation Curve SVM

In [170]:
from sklearn.model_selection import validation_curve

param_range = range(0, 5, 1)

train_scores, test_scores = validation_curve(SVC(), x_train, y_train, param_name = "gamma", scoring="accuracy", param_range=param_range, cv=10)

train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)

plt.plot(param_range, train_scores_mean, label="Training score", color="darkorange")
plt.plot(param_range, test_scores_mean, label="Test score", color="blue")
plt.ylabel("Score")

plt.legend(loc='best')
plt.show()
No description has been provided for this image

Cross Validation SVM

In [173]:
from sklearn.model_selection import cross_val_score
#assuming that SVC is already imported from sklearn

svc=SVC()

scores = cross_val_score(svc, x_train, y_train, cv = 300, scoring='accuracy')

avg_score = round(scores.mean(),2)

print('Cross-validation scores:')
print(scores)
print()

print('Average cross-validation score:')
print(avg_score)
Cross-validation scores:
[0.66666667 1.         1.         0.66666667 1.         1.
 0.66666667 1.         0.66666667 0.66666667 1.         0.66666667
 0.66666667 1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         0.5        1.         1.
 1.         1.         0.5        1.         1.         1.
 0.         1.         0.5        1.         1.         1.
 1.         1.         1.         1.         0.5        1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         0.5        0.5        1.         1.         1.
 1.         0.5        1.         1.         1.         1.
 1.         1.         1.         1.         1.         0.5
 1.         0.5        1.         1.         0.5        0.5
 1.         0.5        1.         0.5        1.         1.
 1.         1.         1.         0.5        0.5        1.
 1.         0.5        0.5        0.5        1.         1.
 1.         1.         0.5        0.5        1.         0.5
 1.         0.5        1.         0.5        1.         1.
 1.         1.         0.5        1.         0.5        0.
 1.         0.5        1.         0.5        0.5        1.
 0.         1.         1.         1.         0.         0.5
 1.         1.         1.         0.5        0.5        1.
 1.         1.         1.         1.         1.         0.5
 1.         1.         0.5        1.         1.         0.5
 1.         0.5        1.         0.5        1.         1.
 1.         0.5        0.5        1.         1.         0.5
 1.         1.         0.         0.5        1.         0.5
 0.5        0.5        1.         0.5        1.         1.
 1.         0.5        0.5        0.         0.5        1.
 1.         0.5        0.         0.5        1.         0.5
 1.         0.5        0.5        1.         0.5        0.5
 0.5        0.5        1.         0.5        1.         0.5
 0.5        0.5        0.5        0.5        0.5        0.
 1.         1.         0.5        0.5        0.         0.
 0.5        0.5        1.         1.         1.         0.5
 1.         0.5        0.5        1.         1.         1.
 1.         0.5        0.5        1.         0.5        1.
 0.         0.5        0.5        1.         1.         1.
 0.5        0.5        1.         0.         1.         0.
 1.         1.         1.         0.5        1.         1.
 1.         0.5        1.         0.5        1.         1.
 0.5        0.         1.         1.         0.5        1.
 0.5        1.         1.         0.5        0.         0.
 1.         0.5        0.5        0.5        0.5        0.5
 0.5        0.5        0.5        0.5        0.5        1.
 1.         1.         0.         1.         1.         0.5
 0.5        0.5        1.         0.5        1.         1.
 0.5        1.         1.         0.5        0.5        1.
 1.         0.5        0.5        0.5        1.         0.5       ]

Average cross-validation score:
0.76

Grid Search SVM

In [177]:
from sklearn.model_selection import GridSearchCV

svc=SVC() #assuming SVC has been imported

# Parameter fürs Hyperparameter-Tuning
parameters = [ {'C':[1, 10, 100, 1000], 'kernel':['linear']},
{'C':[1, 10, 100, 1000], 'kernel':['rbf'], 'gamma':[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]},
{'C':[1, 10, 100, 1000], 'kernel':['poly'], 'degree': [2,3,4] ,'gamma':[0.01,0.02,0.03,0.04,0.05]} 
]


grid_search_svm = GridSearchCV(estimator = svc, param_grid = parameters, scoring = 'accuracy', cv = 10, verbose=0)
grid_search_svm.fit(x_train, y_train)

grid_search_svm_test = GridSearchCV(estimator = svc, param_grid = parameters, scoring = 'accuracy', cv = 10, verbose=0)
grid_search_svm_test.fit(x_test, y_test)

print('Highest Gridsearch-CV score on training set:')
bestscoretrain=grid_search_svm.best_score_
print(round(bestscoretrain, 3))
print()

print('Highest Gridsearch-CV score on test set:')
bestscoretest=grid_search_svm_test.best_score_
print(round(bestscoretest, 3))
print()

print('Parameter for train that gives the best result:')
print(grid_search_svm.best_params_)
print()

print('Parameter for test that gives the best result:')
print(grid_search_svm_test.best_params_)
print()

print('Estimator for train chosen by the gridesearch -CV:')
print(grid_search_svm.best_estimator_)
print()

print('Estimator for test chosen by the gridesearch -CV:')
print(grid_search_svm_test.best_estimator_)
print()
Highest Gridsearch-CV score on training set:
0.76

Highest Gridsearch-CV score on test set:
0.774

Parameter for train that gives the best result:
{'C': 1, 'kernel': 'linear'}

Parameter for test that gives the best result:
{'C': 10, 'gamma': 0.6, 'kernel': 'rbf'}

Estimator for train chosen by the gridesearch -CV:
SVC(C=1, kernel='linear')

Estimator for test chosen by the gridesearch -CV:
SVC(C=10, gamma=0.6)

Decision Trees

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

We will be testing this model for the following reasons:

Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understan

  • .

The logic behind the decision tree can be easily understood because it shows a tree-like structu

  • e.

It is simple to understand as it follows the same process which a human follow while making any decision in real-l

  • fe.

It can be very useful for solving decision-related prob

  • ems.

It helps to think about all the possible outcomes for a pr

  • blem.

There is less requirement of data cleaning compared to other algorithms.

In [184]:
from sklearn import tree

dctree = tree.DecisionTreeClassifier(random_state=4, criterion='entropy')
dctree_fit = dctree.fit(x_train, y_train)
pred_dctree_train = dctree.predict(x_train)
pred_dctree_test = dctree.predict(x_test)
score_dctree_train = round(dctree.score(x_train,y_train),3)
score_dctree_test = round(dctree.score(x_test,y_test),3)
print(f'Decision Tree Score on Trainingset: {score_dctree_train}')
print()
print(f'Decision Tree Score on Test set: {score_dctree_test}')
print()
print(f'Training label real:')
print(y_train)
print()
print(f'Training label predicted:')
print(pred_dctree_train)
print()
print(f'Testslabel real:')
print(y_train)
print()
print(f'Testdaten predicted:')
print(pred_dctree_test)
print()
Decision Tree Score on Trainingset: 1.0

Decision Tree Score on Test set: 0.662

Training label real:
[0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1.
 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1.
 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0.
 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1.
 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0.
 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0.
 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0.
 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0.
 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0.
 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1.
 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1.
 1. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0.
 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.]

Training label predicted:
[0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1.
 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1.
 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0.
 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1.
 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0.
 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0.
 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0.
 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0.
 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0.
 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1.
 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1.
 1. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0.
 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.]

Testslabel real:
[0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1.
 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1.
 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0.
 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1.
 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0.
 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0.
 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0.
 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0.
 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0.
 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0.
 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1.
 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1.
 1. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0.
 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.]

Testdaten predicted:
[0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 0. 1.
 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1. 0. 0.
 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 1.
 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1.
 1. 1. 0. 0. 0. 0. 0. 0. 0. 1.]

Result 1

We see a case of overfitting here as the training score and test score are largely different.

Cross Validation Decision Trees

In [189]:
from sklearn.model_selection import cross_val_score
#assuming here that tree has been imported from sklearn

dtree=tree.DecisionTreeClassifier(random_state=4, criterion='entropy', min_samples_split=3) #same parameters as above

scores = cross_val_score(svc, x_train, y_train, cv = 300, scoring='accuracy')

avg_score = round(scores.mean(),2)

print('Cross-validation scores:')
print(scores)
print()

print('Average cross-validation score:')
print(avg_score)
Cross-validation scores:
[0.66666667 1.         1.         0.66666667 1.         1.
 0.66666667 1.         0.66666667 0.66666667 1.         0.66666667
 0.66666667 1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         0.5        1.         1.
 1.         1.         0.5        1.         1.         1.
 0.         1.         0.5        1.         1.         1.
 1.         1.         1.         1.         0.5        1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         0.5        0.5        1.         1.         1.
 1.         0.5        1.         1.         1.         1.
 1.         1.         1.         1.         1.         0.5
 1.         0.5        1.         1.         0.5        0.5
 1.         0.5        1.         0.5        1.         1.
 1.         1.         1.         0.5        0.5        1.
 1.         0.5        0.5        0.5        1.         1.
 1.         1.         0.5        0.5        1.         0.5
 1.         0.5        1.         0.5        1.         1.
 1.         1.         0.5        1.         0.5        0.
 1.         0.5        1.         0.5        0.5        1.
 0.         1.         1.         1.         0.         0.5
 1.         1.         1.         0.5        0.5        1.
 1.         1.         1.         1.         1.         0.5
 1.         1.         0.5        1.         1.         0.5
 1.         0.5        1.         0.5        1.         1.
 1.         0.5        0.5        1.         1.         0.5
 1.         1.         0.         0.5        1.         0.5
 0.5        0.5        1.         0.5        1.         1.
 1.         0.5        0.5        0.         0.5        1.
 1.         0.5        0.         0.5        1.         0.5
 1.         0.5        0.5        1.         0.5        0.5
 0.5        0.5        1.         0.5        1.         0.5
 0.5        0.5        0.5        0.5        0.5        0.
 1.         1.         0.5        0.5        0.         0.
 0.5        0.5        1.         1.         1.         0.5
 1.         0.5        0.5        1.         1.         1.
 1.         0.5        0.5        1.         0.5        1.
 0.         0.5        0.5        1.         1.         1.
 0.5        0.5        1.         0.         1.         0.
 1.         1.         1.         0.5        1.         1.
 1.         0.5        1.         0.5        1.         1.
 0.5        0.         1.         1.         0.5        1.
 0.5        1.         1.         0.5        0.         0.
 1.         0.5        0.5        0.5        0.5        0.5
 0.5        0.5        0.5        0.5        0.5        1.
 1.         1.         0.         1.         1.         0.5
 0.5        0.5        1.         0.5        1.         1.
 0.5        1.         1.         0.5        0.5        1.
 1.         0.5        0.5        0.5        1.         0.5       ]

Average cross-validation score:
0.76

Grid search Decision Trees

In [192]:
from sklearn.model_selection import GridSearchCV

dtree=tree.DecisionTreeClassifier()
dtree.get_params().keys()


parameters = [{'criterion' : ['gini','entropy','log_loss'], 'min_samples_split' : [2,3,4]}
             ]


grid_search_svm = GridSearchCV(estimator = dtree, param_grid = parameters, scoring = 'accuracy', cv = 10, verbose=0)
grid_search_svm.fit(x_train, y_train)

grid_search_svm_test = GridSearchCV(estimator = dtree, param_grid = parameters, scoring = 'accuracy', cv = 10, verbose=0)
grid_search_svm_test.fit(x_test, y_test)

print('Highest Gridsearch-CV score on training set::')
bestscoretrain=grid_search_svm.best_score_
print(round(bestscoretrain, 3))
print()

print('Highest Gridsearch-CV score on test set::')
bestscoretest=grid_search_svm_test.best_score_
print(round(bestscoretest, 3))
print()

print('Parameter for train that gives the best result:')
print(grid_search_svm.best_params_)
print()

print('Parameter for test that gives the best result:')
print(grid_search_svm_test.best_params_)
print()

print('Estimator for train chosen by the gridesearch -CV:')
print(grid_search_svm.best_estimator_)
print()

print('Estimator for test chosen by the gridesearch -CV:')
print(grid_search_svm_test.best_estimator_)
print()
Highest Gridsearch-CV score on training set::
0.721

Highest Gridsearch-CV score on test set::
0.657

Parameter for train that gives the best result:
{'criterion': 'gini', 'min_samples_split': 4}

Parameter for test that gives the best result:
{'criterion': 'gini', 'min_samples_split': 2}

Estimator for train chosen by the gridesearch -CV:
DecisionTreeClassifier(min_samples_split=4)

Estimator for test chosen by the gridesearch -CV:
DecisionTreeClassifier()

Logistic Regression

Wir verwenden den SGD-Klassifikator von Scikit-learn, um eine logistische Regression durchzuführen.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html. Dies ist sinnvoll, da in der Zielspalte Outcome ein kategorischer Wert steht (0 oder 1).

loss='log' steht für logistische Regression alpha ist der Regularisierungsterm. Je höher der Wert, desto stärker ist die Regularisierung eta0 ist die anfängliche Lernrate für die 'constant', 'invscaling' oder 'adaptive' learning_rate max_iter ist die maximale Anzahl von Durchläufen über die Trainingsdaten (auch Epochen genannt)

Wir definieren das Modell:

In [197]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
In [199]:
log_reg.fit(x_train , y_train)
Out[199]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty  'l2'
dual  False
tol  0.0001
C  1.0
fit_intercept  True
intercept_scaling  1
class_weight  None
random_state  None
solver  'lbfgs'
max_iter  100
multi_class  'deprecated'
verbose  0
warm_start  False
n_jobs  None
l1_ratio  None
In [201]:
print("logistic Regression Coef ", log_reg.coef_)
print("logistic Regression Intercept ", log_reg.intercept_)
print("Score on Training set " ,log_reg.score(x_train, y_train))
print("Score on Test set" ,log_reg.score(x_test, y_test))
logistic Regression Coef  [[ 0.45974335  1.11611972 -0.03398345 -0.00765013 -0.052566    0.57997652
   0.31562067]]
logistic Regression Intercept  [-0.84124214]
Score on Training set  0.7687296416938111
Score on Test set 0.7792207792207793

Observation

We see that our traing score is 0,76 while Test score is 0,79.

We will use cross validation to improve our model

In [204]:
# Crossvalscore on Train set
from sklearn.model_selection import cross_val_score
scores = cross_val_score(log_reg, x_train, y_train, cv = 10, scoring='accuracy')
print('Cross-validation scores on Train set:{}'.format(scores))
print('Average cross-validation score on Train set: {:.4f}'.format(scores.mean()))
Cross-validation scores on Train set:[0.79032258 0.80645161 0.80645161 0.72580645 0.6557377  0.75409836
 0.80327869 0.75409836 0.7704918  0.7704918 ]
Average cross-validation score on Train set: 0.7637

Observation

We see no improvement in our scores after cross validation

Using Gridsearch

In [208]:
# Grid search cross validation on Training set
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(x_train,y_train)

print("tuned hpyerparameters :(best parameters) on Training set ",logreg_cv.best_params_)
print("accuracy on Training set :",logreg_cv.best_score_)
tuned hpyerparameters :(best parameters) on Training set  {'C': 100.0, 'penalty': 'l2'}
accuracy on Training set : 0.7669751454257007

Results and Conclusion from Logistic Regression There are no improvements on the model after the parameters have been tuned

There are no improvements on the model after the parameters have been tuned

Random Forest Classifier

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

This model was chosen for the following reasons:

  • It takes less training time as compared to other algorithms.
  • It predicts output with high accuracy, even for the large dataset it runs efficiently.
  • It can also maintain accuracy when a large proportion of data is missing.
  • The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
  • With the help of this algorithm, disease trends and risks of the disease can be identified. In other words this model is widely used in teh Health sector.
In [215]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=300, bootstrap = True, max_features = 'sqrt')
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
In [217]:
model.score(x_train,y_train)
Out[217]:
1.0
In [219]:
print('Accuracy of Random Forest on test set: {:.2f}'.format(model.score(x_test, y_test)))
Accuracy of Random Forest on test set: 0.76

Observation

From our above results,it may seem that there is a presence of overfitting as 00%

  • there is a big difference between our train and test accuracy scores.
  • The Train set score is 100%

Before we confirm the scores and deal with the issue of overfitting,we will do some more investigations below.

In [226]:
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",model.score(x_train,y_train)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
Classification Report is:
               precision    recall  f1-score   support

         0.0       0.79      0.85      0.82       100
         1.0       0.68      0.59      0.63        54

    accuracy                           0.76       154
   macro avg       0.74      0.72      0.73       154
weighted avg       0.75      0.76      0.76       154

Confusion Matrix:
 [[85 15]
 [22 32]]
Training Score:
 100.0
Mean Squared Error:
 0.24025974025974026
R2 score is:
 -0.05518518518518545

Observation

This implies, our model predicted classified correctly 76% of the times on our test set.

The Precision score (macro avg) stood at 0.74, implying our model correctly classified observations with high risk of Daibetes in the high risk category 74% of the times. The Recall stood at 0.72.

We also have an F1 score of 0.72. The F1 score is the harmonic mean of precision and recall. It assigns equal weight to both the metrics. However, for our analysis it is relatively more important for the model to have low false negative cases (as it will be dangerous to classify high risk patients in low risk category). Therefore, we individually look at Precision and Recall.

Improving our model

Using cross validation

In [233]:
scores = cross_val_score(model, x_train, y_train, cv = 20, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))
Cross-validation scores:[0.80645161 0.77419355 0.90322581 0.74193548 0.64516129 0.80645161
 0.61290323 0.80645161 0.67741935 0.58064516 0.67741935 0.70967742
 0.80645161 0.74193548 0.86666667 0.7        0.76666667 0.7
 0.73333333 0.86666667]

Observation

After performing cross Validation on our model, we see no improvement,so we move to parameter tuning with gridsearch.

With Grid search

In [238]:
#First Gridsearch

from sklearn.model_selection import GridSearchCV

model = RandomForestClassifier()

# declare parameters for hyperparameter tuning
parameters = [ {'n_estimators':[100, 200, 300, 400], 'criterion':['gini']},
               {'n_estimators':[100, 200, 300, 400], 'criterion':['entropy'], 'max_features':['sqrt']},
               {'n_estimators':[100, 200, 300, 400], 'criterion':['log_loss'], 'max_features': ['log2']} 
              ]




grid_search = GridSearchCV(estimator = model,  
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 3 
                           )


grid_search.fit(x_train, y_train)
Out[238]:
GridSearchCV(cv=3, estimator=RandomForestClassifier(),
             param_grid=[{'criterion': ['gini'],
                          'n_estimators': [100, 200, 300, 400]},
                         {'criterion': ['entropy'], 'max_features': ['sqrt'],
                          'n_estimators': [100, 200, 300, 400]},
                         {'criterion': ['log_loss'], 'max_features': ['log2'],
                          'n_estimators': [100, 200, 300, 400]}],
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  RandomForestClassifier()
param_grid  [{'criterion': ['gini'], 'n_estimators': [100, 200, ...]}, {'criterion': ['entropy'], 'max_features': ['sqrt'], 'n_estimators': [100, 200, ...]}, ...]
scoring  'accuracy'
n_jobs  None
refit  True
cv  3
verbose  0
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  False
RandomForestClassifier()
Parameters
n_estimators  100
criterion  'gini'
max_depth  None
min_samples_split  2
min_samples_leaf  1
min_weight_fraction_leaf  0.0
max_features  'sqrt'
max_leaf_nodes  None
min_impurity_decrease  0.0
bootstrap  True
oob_score  False
n_jobs  None
random_state  None
verbose  0
warm_start  False
class_weight  None
ccp_alpha  0.0
max_samples  None
monotonic_cst  None
In [240]:
print('GridSearch CV best score : {:.4f}\n\n'.format(grid_search.best_score_))


# print parameters that give the best results
print('Parameters that give the best results :','\n\n', (grid_search.best_params_))


# print estimator that was chosen by the GridSearch
print('\n\nEstimator that was chosen by the search :','\n\n', (grid_search.best_estimator_))
GridSearch CV best score : 0.7476


Parameters that give the best results : 

 {'criterion': 'gini', 'n_estimators': 100}


Estimator that was chosen by the search : 

 RandomForestClassifier()
In [242]:
#Second Grid Search

from sklearn.model_selection import GridSearchCV

model = RandomForestClassifier()

# declare parameters for hyperparameter tuning
parameters = [ {'n_estimators':[100, 200, 300, 400], 'criterion':['gini'],'max_features':['log2'],
                },
               {'n_estimators':[100, 200, 300, 400], 'criterion':['entropy'], 'max_features':['sqrt'],
                'max_depth':[3,5],'min_samples_split':[2,3],'random_state':[50]},
               {'n_estimators':[100, 200, 300, 400], 'criterion':['log_loss'], 'max_features': ['log2']} 
              ]




grid_search = GridSearchCV(estimator = model,  
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 3 
                           )


grid_search.fit(x_train, y_train)
Out[242]:
GridSearchCV(cv=3, estimator=RandomForestClassifier(),
             param_grid=[{'criterion': ['gini'], 'max_features': ['log2'],
                          'n_estimators': [100, 200, 300, 400]},
                         {'criterion': ['entropy'], 'max_depth': [3, 5],
                          'max_features': ['sqrt'], 'min_samples_split': [2, 3],
                          'n_estimators': [100, 200, 300, 400],
                          'random_state': [50]},
                         {'criterion': ['log_loss'], 'max_features': ['log2'],
                          'n_estimators': [100, 200, 300, 400]}],
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  RandomForestClassifier()
param_grid  [{'criterion': ['gini'], 'max_features': ['log2'], 'n_estimators': [100, 200, ...]}, {'criterion': ['entropy'], 'max_depth': [3, 5], 'max_features': ['sqrt'], 'min_samples_split': [2, 3], ...}, ...]
scoring  'accuracy'
n_jobs  None
refit  True
cv  3
verbose  0
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  False
RandomForestClassifier(criterion='entropy', max_depth=5, random_state=50)
Parameters
n_estimators  100
criterion  'entropy'
max_depth  5
min_samples_split  2
min_samples_leaf  1
min_weight_fraction_leaf  0.0
max_features  'sqrt'
max_leaf_nodes  None
min_impurity_decrease  0.0
bootstrap  True
oob_score  False
n_jobs  None
random_state  50
verbose  0
warm_start  False
class_weight  None
ccp_alpha  0.0
max_samples  None
monotonic_cst  None
In [243]:
print('GridSearch CV best score : {:.4f}\n\n'.format(grid_search.best_score_))


# print parameters that give the best results
print('Parameters that give the best results :','\n\n', (grid_search.best_params_))


# print estimator that was chosen by the GridSearch
print('\n\nEstimator that was chosen by the search :','\n\n', (grid_search.best_estimator_))
GridSearch CV best score : 0.7574


Parameters that give the best results : 

 {'criterion': 'entropy', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 100, 'random_state': 50}


Estimator that was chosen by the search : 

 RandomForestClassifier(criterion='entropy', max_depth=5, random_state=50)

Observation

Since the second grid search gives a better result, let us try using these parameters on our model.

In [248]:
gs=pd.DataFrame(grid_search.cv_results_)
gs.head()
Out[248]:
mean_fit_time std_fit_time mean_score_time std_score_time param_criterion param_max_features param_n_estimators param_max_depth param_min_samples_split param_random_state params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 0.148981 0.055053 0.006583 0.006598 gini log2 100 NaN NaN NaN {'criterion': 'gini', 'max_features': 'log2', ... 0.765854 0.692683 0.754902 0.737813 0.032223 20
1 0.204700 0.007691 0.010356 0.004044 gini log2 200 NaN NaN NaN {'criterion': 'gini', 'max_features': 'log2', ... 0.765854 0.692683 0.745098 0.734545 0.030790 24
2 0.320442 0.006110 0.010796 0.000366 gini log2 300 NaN NaN NaN {'criterion': 'gini', 'max_features': 'log2', ... 0.785366 0.692683 0.740196 0.739415 0.037842 19
3 0.428244 0.008173 0.013839 0.000479 gini log2 400 NaN NaN NaN {'criterion': 'gini', 'max_features': 'log2', ... 0.760976 0.702439 0.754902 0.739439 0.026280 18
4 0.116224 0.020491 0.008416 0.004997 entropy sqrt 100 3.0 2.0 50.0 {'criterion': 'entropy', 'max_depth': 3, 'max_... 0.770732 0.721951 0.750000 0.747561 0.019989 13
In [250]:
#Testing with teh new parameters

model = RandomForestClassifier(criterion='entropy', max_depth=5, max_features='sqrt',
                       random_state=50)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
In [252]:
model.score(x_train,y_train)
Out[252]:
0.8355048859934854
In [254]:
print('Accuracy of Random Forest on test set: {:.2f}'.format(model.score(x_test, y_test)))
Accuracy of Random Forest on test set: 0.76
In [256]:
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",model.score(x_train,y_train)*100)
print("Mean Squared Error:\n",mean_squared_error(y_test,y_pred))
print("R2 score is:\n",r2_score(y_test,y_pred))
Classification Report is:
               precision    recall  f1-score   support

         0.0       0.77      0.90      0.83       100
         1.0       0.73      0.50      0.59        54

    accuracy                           0.76       154
   macro avg       0.75      0.70      0.71       154
weighted avg       0.76      0.76      0.75       154

Confusion Matrix:
 [[90 10]
 [27 27]]
Training Score:
 83.55048859934854
Mean Squared Error:
 0.24025974025974026
R2 score is:
 -0.05518518518518545

Observation

After our gridsearch we get a new result on our training set of 83.5 % which is lower than 100% but which shows a better sign of no overfitting. The score training sets goes a little bit higher to 76%.

Checking for Feauture Importance with Random Forest Classifier

In [261]:
model.feature_importances_
Out[261]:
array([0.08903637, 0.37324672, 0.06128705, 0.05922595, 0.10008714,
       0.20765149, 0.1094653 ])
In [263]:
columns_u=df_NAN.iloc[:,0:7].columns
plt.barh(columns_u, model.feature_importances_)
Out[263]:
<BarContainer object of 7 artists>
No description has been provided for this image

Observation

The figure above shows the relative importance of features and their contribution to the model. Since it was a small dataset with less columns, I didn’t use Feature Selection technique such as PCA.

However we will try to remove the least important column:Blood Pressure and or Skin Thickness and see if itz makes any difference with the results from our grid search best parameters.

In [267]:
#Define our new x_train 

x_train_new=a_del =np.delete(x_train,[2],1)
x_test_new=a_del = np.delete(x_test,[2],1)
In [269]:
x_train_new.shape
Out[269]:
(614, 6)
In [271]:
x_train.shape
Out[271]:
(614, 7)
In [273]:
model_new = RandomForestClassifier(criterion='entropy', max_depth=5, max_features='sqrt',
                       random_state=50)
model_new.fit(x_train_new, y_train)
y_pred = model_new.predict(x_test_new)
In [275]:
model_new.score(x_train_new,y_train)
Out[275]:
0.8371335504885994
In [277]:
print('Accuracy of Random Forest on test set: {:.2f}'.format(model_new.score(x_test_new, y_test)))
Accuracy of Random Forest on test set: 0.78

Observation

When we remove the least important columns the scores only increase by a small factor.

General conclusion¶

Was our Hypothesis correct?

Based on our feature importance:

Glucose is the most important factor in determining the occurrence of diabetes, followed by BMI and age. Other factors such as diabetes pedigree function, pregnancies, blood pressure, skin thickness and insulin also contribute to the prediction. As we can see, the results of feature importance make sense because one of the first things that is actually monitored in high-risk patients is glucose levels.

The risk also increases as the person gets older.

Can the result be trusted?

Based on a maximum training score of 90%, which we were able to find for this data set from our research, we can trust our results with training scores from 80%. In the case of a 100% training score at RandomForest and NeuralNetworks, it is obviously overfitting. Here the corresponding test scores are very low. We wouldn't trust these scores.

To Conclude: Therefore, we can choose the Random Forest Classifier or SVM as the right model due to the high accuracy, precision and recall score. One reason Random Forest Classifier showed improved performance was the presence of outliers. As mentioned earlier, since Random Forest is not a distance-based algorithm, it is not heavily influenced by outliers, while distance-based algorithms such as Logistic Regression showed lower performance.

In [ ]: